modularize optimization internals #7401

ben-schwen · 2025-10-28T17:32:57Z

Closes Using Map instead of lapply turns GForce off #5336
Closes lapply GForce opt could work also without .SD #5032
Closes Move GForce tests to own script #4305
Towards GForce optimisation could be more smart #3815
Closes GForce as.double / as.numeric #2934
Closes benchmark regression #7404
tests (a lot of them)

Adds arithmetic for GForce as demanded in #3815 but does not add support for blocks in j like d[, j={x<-x; .(min(x))}, by=y].

codecov · 2025-10-28T17:51:29Z

Codecov Report

❌ Patch coverage is 99.62121% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 99.01%. Comparing base (f3b166b) to head (283ba85).

Files with missing lines	Patch %	Lines
R/test.data.table.R	95.23%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7401      +/-   ##
==========================================
- Coverage   99.02%   99.01%   -0.02%     
==========================================
  Files          87       87              
  Lines       16754    16843      +89     
==========================================
+ Hits        16591    16677      +86     
- Misses        163      166       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-10-28T17:52:01Z

HEAD=modular_gforce slower P<0.001 for setDT improved in #5427

Generated via commit 283ba85

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	2 minutes and 55 seconds
Installing different package versions	45 seconds
Running and plotting the test cases	5 minutes and 11 seconds

man/test.Rd

R/data.table.R

ben-schwen · 2025-11-02T19:05:06Z

I'm also not sure about moving the tests to optimize.Rraw since this feels kind of wrong and not needed after introducing the new levels/optimization parameter to test.

NEWS.md

ben-schwen · 2026-01-07T10:51:29Z

@MichaelChirico I'm also not 100% convinced about the new optimize.Rraw. I guess the whole idea was that we could simply run the script multiple times with different optimization levels. This need was eliminated by adding the optimize parameter to test() which somehow feels cleaner.

Co-authored-by: Michael Chirico <[email protected]>

MichaelChirico · 2026-01-07T17:35:53Z

@MichaelChirico I'm also not 100% convinced about the new optimize.Rraw. I guess the whole idea was that we could simply run the script multiple times with different optimization levels. This need was eliminated by adding the optimize parameter to test() which somehow feels cleaner.

I see. I still like the idea of a separate script -- the more we peel out of the behemoth tests.Rraw, the better. "eventually" it would be nice to have most tests live in purpose-made test scripts, IMO.

MichaelChirico · 2026-01-07T17:39:45Z

inst/tests/tests.Rraw

  test(2357.2, fread(paste0("file://", f)), DT)
 })
+
+# gforce should also work with Map in j #5336


one last idea -- what happens when the grouping column is part of the aggregation in j?

DT[, .(sum(b) - mean(a)), by=b]

When the grouping column is part of the aggregation we turn off GForce since it will be in .SDall

data.table/R/data.table.R

Lines 430 to 432 in 8129198

for (ii in seq.int(from=2L, length.out=length(jsub)-1L)) {

if (!.gforce_ok(jsub[[ii]], SDenv$.SDall, envir)) {GForce = FALSE; break}

}

R/data.table.R

MichaelChirico · 2026-01-07T17:49:38Z

R/data.table.R

+}
+
+# attempts to optimize j expressions using lapply, GForce, and mean optimizations
+.attempt_optimize = function(jsub, jvnames, sdvars, SDenv, verbose, i, byjoin, f__, ansvars, use.I, lhs, names_x, envir) {


really like how clean this is 👍

R/data.table.R

MichaelChirico

About halfway done reading the implementation now. Thanks for your patience with the review! I'm really excited for this to get finished :)

MichaelChirico · 2026-01-07T18:01:01Z

R/data.table.R

+
+# Optimize expressions using GForce (C-level optimizations)
+# This function replaces functions like mean() with gmean() for fast C implementations
+.optimize_gforce = function(jsub, SDenv, verbose, i, byjoin, f__, ansvars, use.I, lhs, names_x, envir) {


one thing that comes to mind seeing such a long signature -- using a "struct" instead of passing individual arguments, e.g.

https://stackoverflow.com/questions/31864162/what-are-the-pros-and-cons-of-using-a-struct-argument-v-s-multiple-parameters

There may be some possibility to make the code easier to understand if some arguments are grouped or combined.

Not a requirement but something to ponder.

Good point. I think if I we would use structs/lists then we should probably use them for all helpers here, no (for consistency?), e.g. also .optimize_sd_subset, .optimize_c_expr, .optimize_lapply, .optimize_gforce, .optimize_mean and .attempt_optimize.

For .optimize_gforce I can even see the benefit for the long signature but on the other side we run into the problem that arguments might get lost in there...

modular optimization paths - init

ebd152d

ben-schwen added 13 commits October 29, 2025 09:17

make linter happy

71b21ab

move tests

8a9e727

add lapply(list(col1, col2, ...), fun) pattern

04e5782

turn on optimization

a8dde19

add type conversion support to GForce

67f2874

remove stale branch

2876ebe

add tests

c445c38

update man

5410e31

merge tests

dece1c6

polish test fun

5e1789d

add arithmetic

62f1c48

add AST walker and update tests

c47ec27

add tests

1d324d6

ben-schwen marked this pull request as ready for review November 2, 2025 18:01

ben-schwen requested a review from MichaelChirico as a code owner November 2, 2025 18:01

ben-schwen added 2 commits November 2, 2025 19:30

Merge branch 'master' into modular_gforce

6b54c1e

add NEWS

22cf35e

jangorecki reviewed Nov 2, 2025

View reviewed changes

man/test.Rd Outdated Show resolved Hide resolved

jangorecki reviewed Nov 2, 2025

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

ben-schwen mentioned this pull request Nov 2, 2025

benchmark regression #7404

Closed

jangorecki reviewed Nov 3, 2025

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

ben-schwen added 5 commits November 3, 2025 09:45

make function name in massageSD more expressive

25a7e2e

rename levels argument to optimization

eb8056c

update docs

4544398

restore test nums

d40edb8

remove double tests

5e7efb7

ben-schwen added 3 commits January 7, 2026 10:54

update subsuming comments

371e246

add subsuming comments

e2694e1

finish double checking of moving tests

da771d4

ben-schwen and others added 6 commits January 7, 2026 11:53

make optimize more robust

af15282

Co-authored-by: Michael Chirico <[email protected]>

add comment about removing tests in benchmark.Rraw

b61f280

be clearer in NEWS

d8e34d3

add nocovs for errors

c5fb65a

add unwrapper for conversions

9f0e5cf

add more tests

8129198